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Abstract: Three models for linear regression clustering are given, and corres- 
ponding niethods for classification and parameter estimation are developed and 
discussed: The mixture model with fixed regressors (ML-estimation), the fixed 
partition model with fixed regressors (ML-estimation), and the mixture model 
with random regressors (Fixed Point Clustering). The number of clusters is 
treated as unknown. The approaches are compared via an apphcation to Fisher's 
Iris data. By the way, a broadly ignored feature of these data is discovered. 

1 Introduction 

Cluster analysis problems based on stochastic models can be divided into 

two classes: 

1. A cluster is considered as a subset of the data points, which can be 
modeled adequately by a distribution from a class of cluster refer- 
ence distributions (c.r.d.). These distributions are chosen to reflect 
the meaning of homogeneity with respect to the certain data analysis 
problem. Therefore c.r.d. are often unimodal. If the class of c.r.d. is 
parametric, then one is interested in classification of the data points 
and parameter estimation within each cluster. 

2. A cluster is considered as an area of high density of the distribution of 
the whole dataset. No distributional assumption is made for the single 
clusters. 

Clusterwise linear regression is a problem of the first kind since the points of 
each cluster are considered to be generated according to some linear regres- 
sion relation, i.e. one imagines a separate model for each cluster. The class 
of c.r.d. for the regression clustering problem contains distributions of the 
following kind: Consider a dataset Z = {Xi^yi)i^i^ Xi G {1} x HF^yi E I 
being some index set. 

^(ytkt) = ^(xi,/3,c72), defined by 
yi = x[/3 + Ui, C{ui) = .A/'(o,a2), 

(^,c72)6i2^+^x/2+- 

The first component of /3 denotes the intercept. The Ui^i G / are considered 
to be stochastically independent. The Xi are called regressors in the fol- 
lowing. They can be fixed or random with C{xi) = G from some class of 
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distributions Q. In the latter case the regressors are assumed to be i.i.d. and 
independent of {ui)i^j. i^G,/?,a2 then denotes the joint distribution of (x»,yi). 
In our setup, all parameters are considered as unknown. 
The models will be divided into fixed and random regressor models, and 
into mixture and fixed partition models. Mixture models treat the cluster 
membership of a point as random, fixed partition models contain parameters 
for the cluster membership of each point. A fixed partition model with 
random regressors will not be given because this does not lead to an easy 
clustering method. The purpose of the model based approach presented 
here is not to describe the mechanism generating the data, but to find an 
adequate description of the data themselves. Thus, all models can be applied 
to the scime data. In particular, the question is ignored if the regressors were 
really fixed or random. 

The literature on clusterwise linear regression either treats the mixture 
model with fixed regressors (e.g. Quandt and Ramsey (1978), for general p 
and number of clusters DeSaxbo and Cron (1988)) or discusses algorithms 
for a least squares solution (e.g. Bock (1969), Spaeth (1979)) which is re- 
lated to the fixed regressors fixed partition model presented here in the case 
of equal error variances for each cluster. This paper is based on the unpub- 
hshed dissertation Hennig (1997b) where simulation results and proofs are 
given in full detail. 



2 Fixed regressors, mixture model 

Let / be an index set, usually 7 = {l,...,n}. With a given regressor design 
{^i)iei ^ ({1} the fixed regressors mixture model (FRM) is 

defined by 

s 

iei i=i 
J2ej = 1, ej > 0, i = 

3=1 

That is, s denotes the number of clusters and Cj denotes the proportion of 
the cluster j. The log-likelihood function 

In Ln{s, {Cj, fij, <7?)j=i,...,„ Z) = 



E In 1 E 6j . exp 
,€/ \i=i \/27rcr? 
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can be locally maximized for given 5 via the EM-algorithm described in 
DeSarbo and Cron (1988). This works only subject to a? > c Vj with 
some lower bound c > 0 (e.g. c = 0.001) since otherwise In Zr„ would be 
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unbounded. After having performed the algorithm, point i can be classified 
to cluster 7(2) G {1, . . . , s} according to 

7(0 = axg max aj, tij = 

Cij denotes the estimated a posteriori probability for point i to be generated 
by mixture component j. 

The consistency proofs for FRM-ML estimation (Kiefer (1978), DeSarbo and 
Cron (1988)) suffer from not taking possible identifiability problems (Hennig 
(1996)) into account. 

DeSarbo and Cron (1988) suggest Akaike's Information Criterion (AIC) for 
the estimation of 5: 

5 := argmaxlni/n(5) - ^(5), k{s) = {p + 3)s — 1, 
3 

k{s) denotes the number of free parameters to estimate for the cluster num- 
ber s and In Ln{s) is the estimated maximum log-likelihood. Their simu- 
lations do not treat the performance of this proposal. The simulations of 
Hennig (1997b) show the tendency of the AIC to overestimate a small num- 
ber of clusters. Schwaxz' Criterion (SC) gives smaller estimates of 5 for 
n > and seems to work better: 

s := argmaxlnLn(5) — ^^^^k(s). 
s 2 

The discussion of the Iris data example in section 5 illustrates this perform- 
ance. Up to now there are.no theoretical results on the performance of the 
AIC and SC for linear regression mixtures. 

Some alternative proposals for parameter estimation within this model were 
made (e.g. Quandt and Ramsey (1978)), but they lead to greater numerical 
difficulties and were investigated only for 5 = 2,p = 1. 
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Figure 1: Assignment independence - assignment dependence 

The implicit assumption of assignment independence is a disadvantage of 
the FRM. That is, the clusters keep the same proportions €j, j = 1, . . . , 5 for 
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every fixed regressor Xi (see figure 1). The probability of a point (x,-, y,) to be 
generated by cluster j is independent of x and i. This is not generally true. 
For example in a change point setup, the cluster membership is considered 
as determined by x or z. Methods concerning this particular assumption 
can be found e.g. in Krishnaiah and Miao (1988). Also for the Iris data in 
section 5, assignment independence seems not to be fulfilled. 

3 Fixed regressors, fixed partition model 

In the fixed partition approach, the cluster membership of each point i is in- 
dicated by a parameter 7(i). Thus, general kinds of assignment dependency 
can be modeled. The fixed regressors fixed partition model (FRFP) is 
given by 

7: I^{l,...,s}, 

{^i)iei ^ {R^'^^y again given fixed. Under known s, ML-estimation is also 
possible within this model. The log-likelihood function is given by 

biZ,„(s,7, {/3j, cr?)j=i,...,^, Z) = 
-IE E (ln2.-Hn.?+ ^^'-f"--^\ (1) 

A 

For given {^ji^j)j=i,„.,8i (1) is maximized according to 

7(^) = arg mm I In a; + j • (2) 

For given 7, (1) is the sum of the usual log-likelihood functions for homo- 
genous linear regressions within each cluster. Therefore, it is minimized by 
the LS-estimator /^j from the points (x^, j/i) with 7(2) = j and 

cr^ := ' ^ , nj := 2^ 1(7(0 = j)) 3 = h---,s. (3) 

That is, In Ln is monotonely increased if the steps (2) and (3) are carried 
out alternately. This algorithm leads to a local maximum in finitely many 
steps since there axe only finitely many choices for 7. In my experience, 
this is noticeably the fastest algorithm discussed in this paper. Under = 
... = 0"^, the procedure is equivalent to the least squares algorithm of Bock 
(1969). 
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There is some literature that compares mixture and fixed partition ap- 
proaches appHed to location-scaJe and especially Gaussian distributed clusters 
(e.g. Bryant and Williamson (1986)). Analogously to the location-scale case 
it can be shown that FRFP-ML leads to inconsistent parameter estimat- 
ors. This does not matter in practice if the clusters are well separated, but 
causes serious problems otherwise. Like FRM-ML, FRFP-ML needs some 
lower bound on the error variance parameters since otherwise In Ln would 
be unbounded. 

The approaches for the estimation of 5 discussed in section 2 are not reason- 
able here because the number of parameters 7(1) increases with n and their 
value range increases with s. The following modification of the SC worked 
very well in simulations: 

A In 7^ 

5 := arg maxln Ln{s) ^k{s) — O.T^n, k{s) := (p + 2)5, (4) 

8 2 

k{s) denoting the number of regression and scale parameters. 

4 Random regressors, mixture model 

Random regressors have the advantage that the observations can be treated 
as i.i.d. The random regressors mixture model (RRM) has the following 
form: {xi^yi) 6 {1} x JR^ x 1?, i € I, are distributed i.i.d. according to 

s 

i=i 

s 

that is, C{x) = Gj within cluster j. Suitable choices for Gj, j = 1, . . . , 5, 
enable us to model every kind of assignment (in-)dependence. Usually the Gj 
axe not of interest, but unknown. For performing ML-estimation, there needs 
to be a parametric specification of This will not be discussed here. A 
more general approach is presented instead. The RRM is a special case of the 
contamination model (CM) (choose F* = Ylj=2 ^F{Gj,/3j,a^)^ {G.^.a'^) = 
{Gi,/3i,al), e= €1 below): 

£(x, y) = (1 - 6)F(a,/?,.2) + eF*, 0 < 6 < 1, G G a. (5) 

There is some basic difference between the CM and the former models. The 
parameters (G, ^, a^) are clearly not unique in (5) since they can correspond 
to (Gj, (7j) of the RRM for each j. Further, if F* is not assumed to be of 
a mixture type, the CM allows for outliers, i.e. points in the data, which do 
not belong to any regression population. In robust statistics, the CM with 
e < I is a standard tool to describe the occurence of outliers. 
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A method to analyze the CM should find possible choices for (/?, a^) {G is 
treated as nuisance) and therefore needs no specification of some number of 
clusters. 

This goal can be achieved by means of Fixed Point Clustering. The idea 
of this approach is that a data subset, which contains no outliers, can be 
viewed as homogeneous. If at the same time all other points of the dataset 
are outliers with respect to the subset, then the subset is separated from the 
rest and can be considered as a cluster. 
For an indicator vector g G {0, l}'^ define Z{g) := (xj-, yt)pi=i. 

Definition: Z{g) is called Fixed Point Cluster (FPC) w.r.t. Z, iff g is 
a fixed point of 

/: {0,ir H>{0,1}», 

Mg) = 1 [ivi - <m9))y < ca\7.{g))] 

with some prechosen constant c (e.g. c = 10). ^{Ti{g)) and a'^{7i{g)) axe 
regression parameter and error variance estimators based only on the data 
subset Z(5), e.g. the ML-estimators from (3). 

The function / is an inversed outlier identifier (0 for outliers) based on the 
random regressor linear regression model. That is, a point is considered as 
an outlier w.r.t. F(^g,/3,(t^) if it falls into the outlier region {{y — x'/?)^ > ca^} 
(see Davies and Gather (1993) for the concept of model based outlier re~ 
gions). Therefore an FPC 2(5) is exactly the set of non-outliers in Z w.r.t. 
Z{g) and can be interpreted as the set of "ordinary observations" generated 
by some member of the c.r.d.-family. 

The method is similar to some procedures for robust regression where the 

goal is to find a solution of p{ ^2^^^ ) = min^. The function p also 
provides some kind of outlier identification. Local minima could be inter- 
preted as parameters for clusters (Morgenthaler (1990)), but the choice of 
(7^ is not clear and a robust estimator would depend on at least half of the 
data. This is not adequate for cluster analysis. 

FPCs can be computed with the usual fixed point algorithm {g^^^ = f{g^)) 
which converges in finitely many steps (proven in Hennig (1997b)). In order 
to find all relevant clusters, the algorithm must be started many times with 
various starting vectors g. A complete search is numerically impossible. 
However, this also holds for the other two methods unless one is satisfied 
with a local maximum of unknown quality of the log-likelihood function. 
The FPC methodology does not force a partition of the dataset. Non-disjoint 
FPCs and points are possible, which do not belong to any FPC. According 
to that, FPCs are rather an exploratory tool than a parameter estimation 
procedure in the case of a valid partition or mixture model. 
The application of FPC analysis to more general situations is discussed in 
Hennig (1997a), Hennig (1998). 
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5 Iris data example and comparison 



Fisher's Iris data (Fisher (1936)) consists of four measurements of three 
species of Iris plants. The measurements are sepal width (SW), sepaJ length 
(SL), petal width (PW) and petal length (PL). The species are Iris setosa 
(empty circles in figure 2a), Iris virginica (filled circles) and Iris versicolor 
(empty squares). Each species is represented by 50 points. Originally, the 
classification of the plants was no regression problem. The dataset is used 
for illustratory purposes here. Find a more "real world" but less illustratory 
example in Hennig (1998). Only the variables SW and PW are considered. 
PW is modeled as dependent of SW. The distinction in "regressor" and 
"dependent variable" is artificial. The methods use no information about 
the real partition. By eye, the setosa plants are clearly seperated from the 
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Figure 2: Iris data: a) original species - b) FRM-ML clusters with SO 
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Figure 3: a) FRFP-ML clusters - b) Fixed Point Clusters 

other two species, while virginica and versicolor overlap. A linear regression 
relation between SW and PW seems to be appropriate within each of the 
species. 

Using the SC for estimating the number of clusters, FRM-ML-estimation 
finds the four clusters shown in figure 2b. Three clusters correspond to 
the three species. FRM-ML is the only method which provides a rough 
distinction between the virginica and versicolor plants. The fourth cluster 
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(crosses in figure 2b) is some kind of "garbage cluster". It contains some 
points which are not fitted good enough by one of the other three regression 
equations. Note that the deviation from assignment independence of the 
four cluster solution seems to be lower than that of the original partition 
of the species. The AIC for estimating the number of cluster leads to five 
clusters by removing further points from the three large clusters and building 
a second garbage cluster. 

By application of (4), the number of clusters is estimated as 2. Figure 3a 
shows the ML-classification using the FRFP. It corresponds to the most 
naturaP eye-fit. The well separated setosa plants form a cluster, the other 
two species are put together. 

With 150 randomly chosen starting vectors, four FPCs are found. The first 
contains the whole dataset. This happens usually and is an artifact of the 
method. One has to know that to interpret the results adequately. The 
second and third cluster correspond to the setosa plants and the rest of the 
data, respectively. The point labelled by a cross falls in the intersection 
of both clusters and is therefore indicated as special. The fourth cluster is 
labelled by empty squares and consists of 29 points from the setosa cluster, 
which lie exactly on a line because of the rounding of the data. The other 
methods are not able to find this constellation because of the lower bounds 
on the error variances. 

After having noticed this result, one realizes that there are other groups of 
points, which lie exactly on a line, and which are not found by the random 
search of Fixed Point Clustering since they are too small. The fourth FPC 
contains more than half of the setosa species^ and is therefore a remarkable 
feature of the Iris data. 

The results from the Iris data highlight the special characteristics of the 
three methods. The simulation study of Hennig (1997b) leads to similar 
conclusions. 

FRM-ML-estimation is the best procedure if assignment independence 
holds and if the clusters are not well separated. At the Iris data, it 
can discriminate between virginica and versicolor. The stress is on 
regression and error variance parameter estimation. 

FRFP-ML-estimation is the best procedure under most kinds of assign- 
ment dependence to find well separated clusters if there is a clear par- 
tition of the dataset. At the Iris data, the procedure finds the visually 
clearest constellation. The stress is on point classification. 

Fixed Point Clustering is the best procedure to find well separated clus- 
ters if outliers or identifiability problems (Hennig (1996)) exist. Its 
stress is on exploratory purposes. By means of Fixed Point Clustering, 
the discovery that a large part of the setosa cluster lies exactly on a 
line was made. 

^It is not clear, what "most natural" means, but this is the impression of the author. 
'^One cannot see 29 squares because some of the points are identical. 
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